In [144]:
# ignore warning message if cannot open file
# used for formatting slides, not needed for notes
try(source("../startup.R"))

STATS 306: Introduction to Statistical Computing

Administrative stuff

Person Uniqname Office Hours Location
Prof. Jonathan Terhorst jonth Tu 2-4pm 269 West Hall
Byoung Jang bwjang TBD SLC
Kidus Asfaw kasfaw TBD SLC
Luke Puglisi lpuglisi TBD SLC

The book

We will closely follow the book "R for Data Science" (R4DS) by Hadley Wickham and Garrett Grolemund. The electronic version is available for free at http://r4ds.had.co.nz/. There is no need to purchase the hardcopy version unless you enjoy spending money.

Accessing an R programming enviroment

Everything in this course will be done using Jupyter notebooks running the R programming language.

Using the cloud

The easiest way to get up and running in this environment is by using a cloud-based service. I recommend try.jupyter.org or Microsoft's Azure notebook service. Both are free and should provide all the computational resources you need for this course.

Lecture notes will be distributed in Jupyter notebook format before lecture. You are encouraged to bring your laptop to lecture and follow along.

RStudio

Another popular option is RStudio.

You are free to use whatever environment you please, but lectures and assignments will be done using Jupyter notebook.

What this course is not about

This is not a traditional programming course. You will learn to program in R as a byproduct of learning how to visualize, clean, and model data. However we will not cover things like:

  • Algorithms
  • Data structures
  • OOP
  • etc.

If you find that you enjoy programming and want to go further, these would be good topics to learn about in a future course.

Today's lecture

Today's lecture will be a whirlwind tour of some of the major topics to be covered in this course. Don't worry if you don't understand everything. We will cover all of these topics in much more detail later.

Goals for today's lecture

  • Learn how to load the Jupyter notebooks and the R programming environment on your computer.
  • Understand the basic building blocks of a data analysis
  • Learn about the homework submission system

The practice of data science

data science exploration

Tidy data

There are many different ways to represent data in a table, but some are better than others. We say that a data table is "tidy" if:

  • Each row represents an observation.
  • Each column represents a variable.
  • Each different type of data set gets its own table.

Data tables which are not tidy are called messy!

A messy data set

Here is an example of a messy data set:

In [145]:
load(url("https://github.com/terhorst/stats306/raw/master/lecture00/messy_tidy.RData"))
In [146]:
messy
religion<$10k$10-20k$20-30k$30-40k$40-50k$50-75k
Agnostic 27 34 60 81 76 137
Atheist 12 27 37 52 35 70
Buddhist 27 21 30 34 33 58
Catholic 418 617 732 670 638 1116
Don’t know/refused 15 14 15 11 10 35
Evangelical Prot 575 869 1064 982 881 1486
Hindu 1 9 7 9 11 34
Historically Black Prot228 244 236 238 197 223
Jehovah's Witness 20 27 24 24 21 30
Jewish 19 19 25 25 30 95

Note that messy data is sometimes preferable to the tidy representation:

  • It can be more compact and readable.
  • It can take less space to store.
  • It can be easier to enter.

The term "messy" is borrowed from R4DS. It simply means that it is not optimal for analyzing using statistical software.

A tidy data set

Here is the same data in tidy form:

In [147]:
head(tidy)
religionincomefreq
Agnostic <$10k 27
Agnostic$10-20k 34
Agnostic$20-30k 60
Agnostic$30-40k 81
Agnostic$40-50k 76
Agnostic$50-75k 137

Compared to the messy version:

  • Income, which used to be spread across many columns, is now gathered into the second column.
  • Each combination of religion and income category gets its own row.

The tidyverse

The R commands used in the book are collectively known as the tidyverse. They are called that because they expect tidy data as input, and (where necessary) they return tidy data as output.

All of the code examples will assume that you have loaded this package. The way to load packages in R is by using the library() command:

In [148]:
# install.packages('tidyverse') if necessary. (Not needed in cloud environments.)
library(tidyverse)

Data Visualization

The humans are much better than computers at recognizing patterns. Consequently, the first step in most data science projects is to visualize the data. Let's examine a standard R dataset called mpg on the gas mileage for various makes of automobile.

First, the raw data:

In [149]:
print(mpg)
# A tibble: 234 x 11
   manufacturer model   displ  year   cyl trans   drv     cty   hwy fl    class
   <chr>        <chr>   <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
 1 audi         a4       1.80  1999     4 auto(l… f        18    29 p     comp…
 2 audi         a4       1.80  1999     4 manual… f        21    29 p     comp…
 3 audi         a4       2.00  2008     4 manual… f        20    31 p     comp…
 4 audi         a4       2.00  2008     4 auto(a… f        21    30 p     comp…
 5 audi         a4       2.80  1999     6 auto(l… f        16    26 p     comp…
 6 audi         a4       2.80  1999     6 manual… f        18    26 p     comp…
 7 audi         a4       3.10  2008     6 auto(a… f        18    27 p     comp…
 8 audi         a4 qua…  1.80  1999     4 manual… 4        18    26 p     comp…
 9 audi         a4 qua…  1.80  1999     4 auto(l… 4        16    25 p     comp…
10 audi         a4 qua…  2.00  2008     4 manual… 4        20    28 p     comp…
# ... with 224 more rows

As we can see this data frame (actually it is a tibble, a newer type of data frame) has 234 observations (rows) and 11 variables (columns). Only the first 10 rows and columns are displayed above.

Scatter plotting

The city mileage cty and highway mileage hwy should be correlated. Let us plot them.

In [150]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = cty, y = hwy))

As expected, cars that tend to have a higher highway mileage also tend to have a higher city mileage. Let us try to use the class of the vehicle as a color in the above plot.

In [151]:
ggplot(data = mpg) +
    geom_point(mapping = aes(x = cty, y = hwy, color = class))

We see that compact and subcompact cars have the highest mileage whereas SUVs and pickup trucks have the lowest.

Data Transformation

Let us load the nycflight13 dataset that has information about all flights that departed New York area (airport codes JFK, EWR, LGA) in 2013.

In [152]:
library(nycflights13)  ## you may need to install.packages() this
print(flights)
# A tibble: 336,776 x 19
    year month   day dep_t… sched… dep_d… arr_… sched… arr_d… carr… flig… tail…
   <int> <int> <int>  <int>  <int>  <dbl> <int>  <int>  <dbl> <chr> <int> <chr>
 1  2013     1     1    517    515   2.00   830    819  11.0  UA     1545 N142…
 2  2013     1     1    533    529   4.00   850    830  20.0  UA     1714 N242…
 3  2013     1     1    542    540   2.00   923    850  33.0  AA     1141 N619…
 4  2013     1     1    544    545  -1.00  1004   1022 -18.0  B6      725 N804…
 5  2013     1     1    554    600  -6.00   812    837 -25.0  DL      461 N668…
 6  2013     1     1    554    558  -4.00   740    728  12.0  UA     1696 N394…
 7  2013     1     1    555    600  -5.00   913    854  19.0  B6      507 N516…
 8  2013     1     1    557    600  -3.00   709    723 -14.0  EV     5708 N829…
 9  2013     1     1    557    600  -3.00   838    846 - 8.00 B6       79 N593…
10  2013     1     1    558    600  -2.00   753    745   8.00 AA      301 N3AL…
# ... with 336,766 more rows, and 7 more variables: origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We have information about 336,776 flights. Let us first get a smaller, more manageable dataset by looking at flights only in January.

In [153]:
jan_flights = filter(flights, month == 1)

Let us find flights that had a departure delay of more than 1 hour.

In [154]:
print(filter(jan_flights, dep_delay > 60))
# A tibble: 1,821 x 19
    year month   day dep_t… sched_… dep_d… arr_… sched… arr_… carr… flig… tail…
   <int> <int> <int>  <int>   <int>  <dbl> <int>  <int> <dbl> <chr> <int> <chr>
 1  2013     1     1    811     630  101    1047    830 137   MQ     4576 N531…
 2  2013     1     1    826     715   71.0  1136   1045  51.0 AA      443 N3GV…
 3  2013     1     1    848    1835  853    1001   1950 851   MQ     3944 N942…
 4  2013     1     1    957     733  144    1056    853 123   UA      856 N534…
 5  2013     1     1   1114     900  134    1447   1222 145   UA     1086 N765…
 6  2013     1     1   1120     944   96.0  1331   1213  78.0 EV     4495 N165…
 7  2013     1     1   1301    1150   71.0  1518   1345  93.0 MQ     4646 N542…
 8  2013     1     1   1337    1220   77.0  1649   1531  78.0 B6      673 N636…
 9  2013     1     1   1400    1250   70.0  1645   1502 103   EV     4869 N748…
10  2013     1     1   1505    1310  115    1638   1431 127   EV     4497 N179…
# ... with 1,811 more rows, and 7 more variables: origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Let us sort the January flights by departure delays, longest delays first.

In [155]:
print(arrange(jan_flights, desc(dep_delay)))
# A tibble: 27,004 x 19
    year month   day dep_t… sched_… dep_d… arr_… sched… arr_… carr… flig… tail…
   <int> <int> <int>  <int>   <int>  <dbl> <int>  <int> <dbl> <chr> <int> <chr>
 1  2013     1     9    641     900   1301  1242   1530  1272 HA       51 N384…
 2  2013     1    10   1121    1635   1126  1239   1810  1109 MQ     3695 N517…
 3  2013     1     1    848    1835    853  1001   1950   851 MQ     3944 N942…
 4  2013     1    13   1809     810    599  2054   1042   612 DL      269 N322…
 5  2013     1    16   1622     800    502  1911   1054   497 B6      517 N661…
 6  2013     1    23   1551     753    478  1812   1006   486 DL     2119 N326…
 7  2013     1    10   1525     900    385  1713   1039   394 UA      544 N419…
 8  2013     1     1   2343    1724    379   314   1938   456 EV     4321 N211…
 9  2013     1     2   2131    1512    379  2340   1741   359 UA      488 N593…
10  2013     1     7   2021    1415    366  2332   1724   368 B6      377 N789…
# ... with 26,994 more rows, and 7 more variables: origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

We see that the most delayed flight was delayed by 1301 minutes. That's more than 21 hours! We also see that the rows at the bottom all have NA as the value of the variable dep_delay. That's how missing values are represented in R.

Let us find out what were the average delays for different months. You will notice two new things below.

  1. The assignment operator in R is traditionally written <-. Some people prefer to use the more standard =. Both work and which one you choose is a matter of personal preference.
  2. Enclosing an assignment in round brackets causes the assignment to be made as well as printed.
In [156]:
by_month <- group_by(flights, year, month)
(monthly_averages <- summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)))
yearmonthdelay
2013 1 10.036665
2013 2 10.816843
2013 3 13.227076
2013 4 13.938038
2013 5 12.986859
2013 6 20.846332
2013 7 21.727787
2013 8 12.611040
2013 9 6.722476
2013 10 6.243988
2013 11 5.435362
2013 12 16.576688
In [157]:
ggplot(data = monthly_averages) +
    geom_bar(mapping = aes(x = month, y = delay), stat = 'identity') +
    scale_x_discrete(limits = seq(1,12))

Exploratory Data Analysis

Let us now look at diamonds, a dataset containing the prices and other information about almost 54,000 diamonds.

In [158]:
print(diamonds)
# A tibble: 53,940 x 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1 0.230 Ideal     E     SI2      61.5  55.0   326  3.95  3.98  2.43
 2 0.210 Premium   E     SI1      59.8  61.0   326  3.89  3.84  2.31
 3 0.230 Good      E     VS1      56.9  65.0   327  4.05  4.07  2.31
 4 0.290 Premium   I     VS2      62.4  58.0   334  4.20  4.23  2.63
 5 0.310 Good      J     SI2      63.3  58.0   335  4.34  4.35  2.75
 6 0.240 Very Good J     VVS2     62.8  57.0   336  3.94  3.96  2.48
 7 0.240 Very Good I     VVS1     62.3  57.0   336  3.95  3.98  2.47
 8 0.260 Very Good H     SI1      61.9  55.0   337  4.07  4.11  2.53
 9 0.220 Fair      E     VS2      65.1  61.0   337  3.87  3.78  2.49
10 0.230 Very Good H     VS1      59.4  61.0   338  4.00  4.05  2.39
# ... with 53,930 more rows

Let us try to understand the relationship between the price of a diamond and its weight in carats.

In [159]:
ggplot(data = diamonds) +
    geom_point(mapping = aes(x = carat, y = price))

This graphic suffers from overplotting: there are so many data points that they coalesce into a black blob, hindering interpretation.

We can change the geometry to bin2d that creates rectangular regions and uses full color to show how many points landed in each bin.

In [160]:
ggplot(data = diamonds) + 
  geom_bin2d(mapping = aes(x = carat, y = price))

We can also choose hexagonal bins.

In [161]:
ggplot(data = diamonds) + 
  geom_hex(mapping = aes(x = carat, y = price))

We can also use geom_smooth to create a smooth plot of how price varies as a function of weight in carats.

In [162]:
ggplot(data = diamonds) + 
  geom_smooth(mapping = aes(x = carat, y = price))
`geom_smooth()` using method = 'gam'

What about the relationship between price and cut? Since cut is a categorical variable, let us use a boxplot using the geom_boxplot geometry.

In [163]:
ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = cut, y = price))

Hmmm. That looks strange. We might expect price to go up as the quality increases but here the lowest quality diamonds seems to be the most expensive! This is because fair quality diamonds are also larger and larger diamonds tend to be more expensive. Let's plot carat versus cut.

In [164]:
ggplot(data = diamonds) +
    geom_boxplot(mapping = aes(x = cut, y = carat))

As you can see, untangling the relationship between even just 3 variables can be hard. We haven't even looked at two other "C's" yet -- color and clarity!

Strings

Strings are sequences of characters and store textual data. Many data sets contain strings which you will need to manipulate in order to extract useable data. We will use the stringr package to work with strings in R.

In [165]:
library(stringr)

All string manipulation functions in stringr start with str_. Here's how to compute the length of a string.

In [166]:
str_length("This is a string.")
17

str_c can be used to join, or concatenate, strings.

In [167]:
str_c("Birds of a feather","flock together.")
'Birds of a featherflock together.'

Oops, that didn't put a space in between. We can add it using the sep argument.

In [168]:
str_c("Birds of a feather","flock together.", sep=" ")
'Birds of a feather flock together.'

We can sort strings in alphabetic order.

In [169]:
data_science_languages = c("R", "Python", "Scala", "Julia")
str_sort(data_science_languages)
  1. 'Julia'
  2. 'Python'
  3. 'R'
  4. 'Scala'

We can look for patterns in strings using str_view. For example, let us try to find a very simple pattern -- the letter "a" -- in the language names above.

In [170]:
str_view(data_science_languages, "a")

str_view matches only the first occurence of a pattern. To match all occurrences, we can use str_view_all

In [171]:
str_view_all(data_science_languages, "a")

To find a pattern only at the end of a strong, we can use the anchor "$" that matches the end of a string.

In [172]:
str_view(data_science_languages, "a$")

If we only want to find out whether a pattern matches a string, we can use str_detect. The the code below [aeiou] is a group that matches any letter in the given group, in this case all 5 vowels in the English alphabet.

In [173]:
str_detect(data_science_languages, "[aeiou]")
  1. FALSE
  2. TRUE
  3. TRUE
  4. TRUE

(Aside: [aeiou] is an example of a regular expression. We'll talk about these more later in the course.)

Dates and Times

Dates are another common type of data that we will encounter The lubridate package helps us work with dates and times.

In [174]:
library(lubridate)

Here's how to get today's date as a date object and the current time as a date-time object.

In [175]:
today()
now()
[1] "2018-01-03 12:18:49 EST"

Let's convert the time data in the flights data set to a proper date-time representation. We'll use the select() and mutation() commands to pick out only the columns pertaining to time and create date-time objects. (Here we are using the pipe operator %>% which we will discuss later.)

In [176]:
flights_dt = flights %>% 
             select(year, month, day, hour, minute) %>% 
             mutate(departure = make_datetime(year, month, day, hour, minute))
print(flights_dt)
# A tibble: 336,776 x 6
    year month   day  hour minute departure          
   <int> <int> <int> <dbl>  <dbl> <dttm>             
 1  2013     1     1  5.00   15.0 2013-01-01 05:15:00
 2  2013     1     1  5.00   29.0 2013-01-01 05:29:00
 3  2013     1     1  5.00   40.0 2013-01-01 05:40:00
 4  2013     1     1  5.00   45.0 2013-01-01 05:45:00
 5  2013     1     1  6.00    0   2013-01-01 06:00:00
 6  2013     1     1  5.00   58.0 2013-01-01 05:58:00
 7  2013     1     1  6.00    0   2013-01-01 06:00:00
 8  2013     1     1  6.00    0   2013-01-01 06:00:00
 9  2013     1     1  6.00    0   2013-01-01 06:00:00
10  2013     1     1  6.00    0   2013-01-01 06:00:00
# ... with 336,766 more rows

We can now plot a histogram of flight counts by departure time. We will use a binwidth of a day.

In [177]:
ggplot(data = flights_dt) +
  geom_histogram(mapping=aes(x = departure), binwidth=24*60*60)  # bin width for date-time is in seconds

What do those dips correspond to? Let us look more closely at data only from January.

In [178]:
flights_dt_jan = filter(flights_dt, departure < ymd(20130201))
ggplot(data = flights_dt_jan) +
  geom_histogram(mapping=aes(x = departure), binwidth=24*60*60)  # bin width for date-time is in seconds

We see fewer flights on Jan 6, 13, 20, 27. These must be Sundays. Let us check.

In [179]:
wday(ymd(20130106), label=TRUE, abbr=FALSE)
Sunday

Functions

Functions are the most basic mechanism for code reuse. If you find yourself copying and pasting code more than twice, you probably want to think about writing a function. Then, later changes only need to be done in one place and not in all the many places where you copied your code.

In [180]:
say_hello <- function(x) {
    str_c("Hello ", x, "!")
}
say_hello("World")
say_hello("STATS 306")
'Hello World!'
'Hello STATS 306!'

say_hello is the name of our function. x is the argument to the function and the code between the curly brackets { and } is the body of the function.

Let's see what happens when we don't provide an argument.

In [181]:
say_hello()
Error in str_c("Hello ", x, "!"): argument "x" is missing, with no default
Traceback:

1. say_hello()
2. str_c("Hello ", x, "!")   # at line 2 of file <text>
3. stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE)

We can supply the default value of argument as the code below shows.

In [ ]:
say_hello <- function(x = "there") {
    str_c("Hello ", x, "!")
}

If we supply the argument, the function works as before.

In [ ]:
say_hello('friends')

If we don't, it uses the default argument.

In [ ]:
say_hello()

Let's see what happens when we pass along the empty string "" as an argument to say_hello.

In [ ]:
say_hello("")

Perhaps, we don't like the space between "Hello" and "!" in this case. So we will add a check to see if the argument is an empty string.

In [ ]:
say_hello <- function(x = "there") {
    if (str_length(x) == 0) {
        "Hello!"
    } else {
        str_c("Hello ", x, "!")
    }
}
say_hello("")

We just saw an instance of conditional execution of code using the if statement.

Vectors

R has two types of vectors:

Atomic vectors: These are homogeneous in the sense that every element is of the same type. For example, logical, integer, double, character.

Lists: These are heterogenous in the sense that different elements can be of different types. In particular, a list can contain another list.

In [ ]:
(my_lgl_vector <- 1:10 %% 2 == 1) # TRUE if odd, FALSE if even

Vectors are very useful in R. We can query their type, length, and apply mathematical operations across every element of the vector simultaneously:

In [ ]:
typeof(my_lgl_vector)
In [ ]:
length(my_lgl_vector)
In [ ]:
(my_dbl_vector <- 1:10 / 100)
In [ ]:
sqrt(my_dbl_vector)

What happens if we apply an identity operation?

In [ ]:
sqrt(my_dbl_vector) ^ 2 == my_dbl_vector

This does not do what we expect because of the inherent imprecision of dealing with floating point numbers in code. However, we can check that they two vectors are near() to each other to within a tiny error tolerance:

In [ ]:
near(sqrt(my_dbl_vector) ^ 2, my_dbl_vector)

Iteration

Iteration is an important concept in programming. It refers to doing the same operations repeatedly. Let us consider the famous Fibonacci sequence whose $n$th term is defined as

$$ F(n+1) = F(n) + F(n-1) $$

starting with $F(1) = 0$ and $F(2) = 1$.

The code below computes the first 10 Fibonacci numbers using a for loop.

In [ ]:
previous = 0
current = 1
for (i in 1:10) {
    print(previous)
    new = current + previous
    previous = current
    current = new
}

What if we want to print all Fibonacci numbers less than 1000. We do not know how long that will take. To iterate a computation as long as some condition is true, we can use a while loop.

In [ ]:
previous = 0
current = 1
while (previous < 1000) {
    print(previous)
    new = current + previous
    previous = current
    current = new
}